Skip to content

Update bos_token #4806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Update bos_token #4806

wants to merge 1 commit into from

Conversation

aacedar
Copy link
Contributor

@aacedar aacedar commented Jul 2, 2025

PR type

Clipboard_Screenshot_1751457316

PR information

  1. qwen模型的encode 不管add_special_tokens=True/False,都不会输出special token,如 <|im_start|>, qwen模型这些特殊的token会在拼接的时候补上,无需额外处理
    所以需要判断bos_token是否为空
    2. 若bos_token不为空,则表示需要加specal token,bos_token = all_tokens[:idx]得到的是列表,在执行prompts_text.append(''.join(res_context_list))
    会因为res_context_list[0]的类型是list而报错,并且这里的bos_token是token_id而不是token,在encode时也会出错,所以采用self.tokenizer.bos_token更为合适

@hjh0119
Copy link
Collaborator

hjh0119 commented Jul 2, 2025

Nice catch!

Should we modify the function _apply_chat_template_to_messages_list to make it compatible with eos_token?

For example:

    def _apply_chat_template_to_messages_list(self, messages_list: InputsType):
        prompts_text = []
        for messages in messages_list:
            InferRequest.remove_response(messages)
            template_inputs, _ = StdTemplateInputs.from_dict({'messages': messages})
            res_context_list, _, _ = self.template._swift_encode(template_inputs)
-           prompts_text.append(''.join(res_context_list))
+           prompts_text.append(''.join(elem for elem in res_context_list if isinstance(elem, str)))

        return prompts_text

@aacedar
Copy link
Contributor Author

aacedar commented Jul 2, 2025

@hjh0119 I have read the full self.template._swift_encode(template_inputs) function, the return result is list, if the element of res_context_list is list, this is not corrent result, here should raise an exception instead of ignore the error
so, I think here not judge if isinstance(elem, str), just throw an exception to remind user

@aacedar
Copy link
Contributor Author

aacedar commented Jul 3, 2025

if elem type is list, that means the elem is token_id value, not text value.
for example: tokenize.encode("<|begin▁of▁sentence|><|end▁of▁sentence|>" you will get token_id list, but it is str type befor tokenize.encode().
The elem of res_context_list should all be of type str, and then encoded uniformly in subsequent processing。

@hjh0119
Copy link
Collaborator

hjh0119 commented Jul 3, 2025

I believe the related issue arises because template_meta.prefix contains the BOS token as an integer, which leads to a join failure.

Therefore, I think ignoring integer values in the _apply_chat_template_to_messages_list method is a safer approach, since this method is only used for logging purposes and omitting the BOS token will not have any impact.

Are there any training scripts that fail to run on the main branch but work with this PR?

@aacedar
Copy link
Contributor Author

aacedar commented Jul 3, 2025

yes, the related issue arises because template_meta.prefix , I can not agree with you any more. you can see my other pr, this pr solve template_meta.prefix bug。#4813

@hjh0119
Copy link
Collaborator

hjh0119 commented Jul 3, 2025

so are there any training/infer scripts that fail to run on the main branch but work with this PR?

@aacedar
Copy link
Contributor Author

aacedar commented Jul 3, 2025

just replace your_own_path in --model, --output_dir, you can test

CUDA_VISIBLE_DEVICES=7 \

swift rollout \

  --model your_own_path/deepseek-ai/deepseek-coder-6.7b-base \

 > logs/log-grpo-rollout.log 2>&1 &



CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \

NPROC_PER_NODE=6 \

swift rlhf \

  --rlhf_type grpo \

  --model your_own_path/deepseek-ai/deepseek-coder-6.7b-base \

  --reward_funcs accuracy \

  --use_vllm true \

  --vllm_mode server \

  --vllm_server_host 127.0.0.1 \

  --vllm_server_port 8000 \

  --train_type full \

  --torch_dtype bfloat16 \

  --dataset AI-MO/NuminaMath-TIR#1000 \

  --split_dataset_ratio 0 \

  --max_completion_length 512 \

  --num_train_epochs 1 \

  --per_device_train_batch_size 2 \

  --learning_rate 1e-6 \

  --gradient_accumulation_steps 2 \

  --save_total_limit 2 \

  --logging_steps 1 \

  --deepspeed zero2 \

  --max_length 4096 \

  --warmup_ratio 0.05 \

  --dataloader_num_workers 2 \

  --dataset_num_proc 4 \

  --num_generations 6 \

  --temperature 0.9 \

  --top_p 0.9 \

  --top_k 50 \

  --log_completions true \

  --num_iterations 1 \

  --beta 0.01 \

  --output_dir ./ds-base-grpo \

  --report_to tensorboard \

  > logs/log-grpo-test.log 2>&1 &

@hjh0119
Copy link
Collaborator

hjh0119 commented Jul 3, 2025

I've tested the script you provided using your commit, but it still raises the following error: TypeError: sequence item 0: expected str instance, list found.

@@ -1039,8 +1039,16 @@ def _swift_encode(self, inputs: StdTemplateInputs):
idx = all_tokens.index(single_token[0])
bos_token = all_tokens[:idx]
sep_token = all_tokens[idx + 1:]
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_add_bos is False for Qwen models
So the encode logic of Qwen models will not reach here

Copy link
Contributor Author

@aacedar aacedar Jul 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function will not work for Qwen models because the qwen.py:QwenTemplateMeta class set auto_add_bos = False, it will be raise exception for deepseek model

swift/llm/template/template/qwen.py

@dataclass
class QwenTemplateMeta(ChatmlTemplateMeta):
    default_system: Optional[str] = DEFAULT_SYSTEM
    auto_add_bos: bool = False
    stop_words: List[Word] = field(default_factory=lambda: ['<|endoftext|>'])
    agent_template: str = 'hermes'

@aacedar
Copy link
Contributor Author

aacedar commented Jul 3, 2025

I've tested the script you provided using your commit, but it still raises the following error: TypeError: sequence item 0: expected str instance, list found.

you should fix #4806 bug first, I have commit pr for this bug

@aacedar aacedar closed this Jul 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants